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Precipitation is a complex physical process that varies in space 
and time. Predictions and interpolations at unobserved times and/or 
locations help to solve important problems in many areas. In this 
paper, we present a hierarchical Bayesian model for spatio-temporal 
data and apply it to obtain short term predictions of rainfall. The 
model incorporates physical knowledge about the underlying pro- 
cesses that determine rainfall, such as advection, diffusion and con- 
vection. It is based on a temporal autoregressive convolution with 
spatially colored and temporally white innovations. By linking the 
advection parameter of the convolution kernel to an external wind 
vector, the model is temporally nonstationary. Further, it allows for 
nonseparable and anisotropic covariance structures. With the help 
of the Voronoi tessellation, we construct a natural parametrization, 
that is, space as well as time resolution consistent, for data lying on 
irregular grid points. In the application, the statistical model com- 
bines forecasts of three other meteorological variables obtained from 
a numerical weather prediction model with past precipitation obser- 
vations. The model is then used to predict three-hourly precipita- 
tion over 24 hours. It performs better than a separable, stationary 
and isotropic version, and it performs comparably to a determinis- 
tic numerical weather prediction model for precipitation and has the 
advantage that it quantifies prediction uncertainty. 

1. Introduction. Precipitation is a very complex phenomenon that varies 
in space and time, and there are many efforts to model it. Predictions and in- 
terpolations at unobserved times and/or locations obtained from such mod- 
els help to solve important problems in areas such as agriculture, climate 
science, ecology and hydrology. Stochastic models have the great advantage 
of providing not only point estimates, but also quantitative measures of un- 
certainty. They can be used, for instance, as stochastic generators [Wilks 
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(1998), Makhnin and McAllister (2009)] to provide realistic inputs to flood- 
ing, runoff' and crop growth models. Moreover, they can be applied as com- 
ponents within general circulation models used in climate change studies 
[Fowler et al. (2005)], or for postprocessing precipitation forecasts [Sloughter 
et al. (2007)]. 

1.1. Distributions for precipitation. A characteristic feature of precipi- 
tation is that its distribution consists of a discrete component, indicating 
occurrence of precipitation, and a continuous one, determining the amount 
of precipitation. As a consequence, there are two basic statistical modeling 
approaches. The continuous and the discrete part are either modeled sepa- 
rately [Coe and Stern (1982), Wilks (1999)] or together [Beh (1987), Wilks 
(1990), Bardossy and Plate (1992), Hutchinson (1995), Sanso and Guenni 
(2004)]. Typically, in the second approach, the distribution of the rainfall 
amounts and the probability of rainfall are determined together using what 
is called a censored distribution. Originally, this idea goes back to Tobin 
(1958) who analyzed household expenditure on durable goods. For model- 
ing precipitation, Stidd (1973) took up this idea and modified it by including 
a power-transformation for the nonzero part so that the model can account 
for skewness. 

1.2. Correlations in space and time. For modeling processes that involve 
dependence over space and time, there are two basic approaches [see, e.g., 
Cressie and Wikle (2011)]: one which models the space-time covariance 
structure without distinguishing between the time and space dimensions, 
and a dynamic one which takes the natural ordering in the time dimension 
into account. 

The first approach usually follows the traditional geostatistical paradigm 
of assuming a parametric covariance function [for an introduction into geo- 
statistics, see, e.g., Cressie (1993) or Gelfand et al. (2010)]. Several para- 
metric families specifying explicitly the joint space-time covariance struc- 
ture have been proposed [Jones and Zhang (1997), Cressie and Huang (1999), 
Gneiting (2002), Ma (2003), Stein (2005), Paciorek and Schervish (2006)]. In- 
terpretability and, especially, computational complexity are challenges when 
working with parametric space-time covariance functions. 

There is, however, a fundamental difference between the spatial and the 
temporal dimensions. Whereas there is an order in the time domain, there 
exists no obvious order for space. It is therefore natural to assume a dynamic 
temporal evolution combined with a spatially correlated error term [S0lna 
and Switzer (1996), Wikle and Cressie (1999), Huang and Hsu (2004), Xu, 
Wikle and Fox (2005), Gelfand, Banerjee and Gamerman (2005)]. As Wikle 
and Hooten (2010) state, the dynamic approach can be used to construct 
realistic space-time dependency structures based on physical knowledge. 
Further, the temporal Markovian structure offers computational benefits. 
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1.3. Models for precipitation. Isham and Cox (1994) state that there 
are three broad types of mathematical models of ramfall: deterministic me- 
teorological models [Mason (1986)], intermediate stochastic models [Le Cam 
(1961), Cox and Isham (1988), Waymire, Gupta and Rodriguez-Iturbe (1984)], 
and empirical statistical models. Meteorological models represent as realis- 
tically as possible the physical processes involved. As noted by Kyriakidis 
and Journel (1999), deterministic models typically require a large number 
of input parameters that are difficult to determine, whereas stochastic mod- 
els are usually based on a small number of parameters. Nevertheless, sta- 
tistical models can also incorporate knowledge about physical processes. 
Parametrizations can be chosen based on physical knowledge and covariates 
reflecting information about the physical processes can be included. 

In the following, we briefly review statistical models for precipitation. For 
modeling daily precipitation at a single measuring site. Stern and Coe (1984) 
use a nonstationary second-order Markov chain to describe precipitation oc- 
currence and a gamma distribution to describe rainfall amounts. Hughes 
and Guttorp (1994) and Hughes, Guttorp and Charles (1999) model pre- 
cipitation occurrence using a nonhomogeneous hidden Markov model. With 
the help of an unobserved weather state they link large scale atmospheric 
circulation patterns with the local precipitation process. Bellone, Hughes 
and Guttorp (2000) and Charles, Bates and Hughes (1999) both extend 
this approach by also modeling precipitation amounts. The former propose 
to use gamma distributions, whereas the latter use empirical distribution 
functions. Ailliot, Thompson and Thomson (2009) present a hidden Markov 
model in combination with the transformed and censored Gaussian distri- 
bution approach used in Bardossy and Plate (1992). Also building on the 
same censoring idea, Sanso and Guenni (1999a) model precipitation occur- 
rence and amount of precipitation using a transformed multivariate Gaussian 
model with a spatial correlation structure. Further works on statistical pre- 
cipitation modeling include Sanso and Guenni (1999b, 2000), Brown et al. 
(2001), Stehlik and Bardossy (2002), AUcroft and Glasbey (2003), Sloughter 
et al. (2007), Berrocal, Raftery and Gneiting (2008) and Fuentes, Reich and 
Lee (2008). 

1.4. Outline. The model presented in the following is a hierarchical 
Bayesian model for spatio-temporal data. At the data stage, we opt for 
a modeling approach that determines the discrete and the continuous parts 
of the precipitation distribution together. This is done by assuming the 
existence of a latent Gaussian variable which can be interpreted as a precip- 
itation potential. The mean of the Gaussian variable is related to covariates 
through a regression term. The advantages of this one-part modeling strat- 
egy are twofold: the model contains less parameters and it can deal with the 
so-called spatial (and temporal) intermittence effect [Bardossy and Plate 
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(1992)] which suggests smooth transitions between wet and dry areas. This 
means that at the edge of a dry area the amount of rainfall should be low. 
Wilks (1998) notes that, indeed, lower rainfall intensity is observed when 
more neighboring stations are dry. This feature also reflects the idea that 
if a model determines a low probability of rainfall for a given situation, it 
should also give a small expected value for its amount conditional on this 
event, and vice versa. However, we note that there is no consensus in the 
literature whether the two parts of precipitation should be modeled together 
or separately. 

At the process level, we use a dynamic model for accounting for spatio- 
temporal variation. The model explicitly incorporates knowledge about the 
underlying physical processes that determine rainfall, such as advection, 
diffusion and convection. Approximating an integrodifference equation, we 
obtain a temporally autoregressive convolution with spatially colored and 
temporally white innovations. The model is nonstationary, anisotropic, and 
it allows for nonseparable covariance structures, that is, covariance struc- 
tures where spatial and temporal variation interact. While our approach 
builds on existing models, it includes several novel features. With the help 
of the Voronoi tessellation, a natural parametrization for data lying on an 
irregular grid is obtained. The parametrization based on this tessellation is 
space as well as time resolution consistent, physically realistic and allows for 
modeling irregularly spaced data in a natural way. To our knowledge, the 
use of the Voronoi tessellation for spatio-temporal data on an irregular grid 
is new. By linking the advection parameter of the kernel to an external wind 
vector, the model is temporally nonstationary. 

The model is applied to predict three-hourly precipitation. The prediction 
model is based on three forecasted meteorological variables obtained from 
an NWP model as well as past rainfall observations. We compare predictions 
from the statistical model with the precipitation forecasts obtained from the 



The remainder is organized as follows. In Section 2 the model specifica- 
tions are presented. In Section 3 it is shown how the model can be fitted 
to data using a Markov chain Monte Carlo (MCMC) algorithm and how 
predictions can be obtained. Next, in Section 4 the model is applied to ob- 
tain short term predictions of three-hourly rainfall. Conclusions are given in 
Section 5. 

2. The model. It is assumed that the rainfall Yt{s) at time t on site 
s = (x, uY € depends on a latent normal variable Wj(s) through 



NWP. 



Yt{s) = 



if Wt(s) <0 
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Wt{s) 



if Wtis) > 0, 
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where A > 0. A power transformation is needed since precipitation amounts 
are more skewed than a truncated normal distribution and since the scatter 
of the precipitation amounts increases with the average amount. The latent 
variable Wj(s) can be interpreted as a precipitation potential. 

This latent variable Wj(s) is modeled as a Gaussian process that is spec- 
ified as 

(2) M^t(s)=xt(sf/3 + 6(s) + i^i(s), 

where /3 € R'', and ut{s) ~ A^(0, r^),T^ > 0, are i.i.d. The mean Xf(s)'^/3 of 
Wt{s) is assumed to depend linearly on regressors xt(s) € M.^. For notational 
convenience, we split the terms specifying the covariance function into a 
structured part ^i(s) and an unstructured "nugget" ft(s). The term Ct(s) is 
a zero-mean Gaussian process that accounts for structured variation in time 
and space. It is specified below in Section 2.1. The nugget ft(s) models mi- 
croscale variability and measurement errors. Since, typically, the resolution 
of the data does not allow for distinguishing between microscale variability 
and measurement errors, we model these two sources of variation together. 
Note that the covariates xj(s) will usually be time and location dependent. 
In addition to weather characteristics, Fourier harmonics can be included to 
account for seasonality, and functions of coordinates can account for smooth 
effects in space. 

2.1. The convolution autoregressive model. We follow the dynamic ap- 
proach and define an explicit time evolution through the following integrod- 
ifference equation (IDE): 

(3) 6(s)= / /i^(s-s')6-i(s')ds' + et(s), seM^ 

where St{s) is a Gaussian innovation that is white in time and colored in 
space, and is a Gaussian kernel, 

(4) h^{s - s') = ^exp{-{s - s' - fXtf-S-^s - s' - /i,)), 

where the parameter vector tD combines (j) and the elements of /x^ and 5]"^. 
Note that fi^ shifts the kernel and determines the range and the degree 
of anisotropy. The parameter (p controls the amount of temporal correlation. 
More details on the interpretation of the model and specific choices of the 
parameters /Xj and 5] are discussed below in Section 2.2. An illustration of 
this kernel can be found in the application in Section 4.2. 

In the following, we assume that we have measurement locations Sj, 
i = 1, . . . ,N, where measurements are made at times t = 1, . . . , T. Instead 
of working with a fine spatial grid with many missing observations, we 
formulate an approximate model for the values at the stations only, = 
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,^j(sAr))'. Discretizing the integral in (3), we obtain 

/ /i^(s,-s')et-i(s')c?s'« / h^{s,-s%.i{s')ds' 
Jr^ J a 

N 

a^h^{si - Sj)^t^i{sj)\Aj\. 
i=i 

Here ^ C is an area which contains the convex hull of all stations, the 
sets Ai,i = 1, . . . ,N , form a tessellation of A with Sj € Ai and \Aj \ denotes 
the area of cell Aj. 

Our model can then be written as the vector autoregression 

(6) ^, = 0a^,_i + et, G^GM^^^, 
where 

(7) iGt)ij = exp(-(si - s^- - /xj^5]~^(si - s^- - /i^)) • \Aj\, 

and where £t = (et(si), . . . , (sat))'. 

Note that this process does not exhibit explosive growth if the largest 
eigenvalue of (j)Gt is smaller than one. To ensure this, we check in our ap- 
plication that the largest eigenvalue is smaller than one for the parameters 
at the posterior modes. 

If the Sj's form a regular grid, a tessellation is straightforward. Otherwise, 
we propose to use the Voronoi tessellation [Voronoi (1908)] which decom- 
poses the space. Specifically, each site Sj has a corresponding Voronoi cell 
consisting of all points closer to Sj than to any other site Sj, j ^ i [see, 
e.g., Okabe et al. (2000) for more details]. Stations on the boundary of the 
convex hull have cells with infinite area. For these stations, we define \Ai\ 
as described in the following. We first calculate the Voronoi tessellation of 
M?. We then replace unbounded cells by cells whose area is the average area 
of the neighboring bounded cells. In Figure 1, the Voronoi tessellation for 
the Swiss stations used in the application below is shown as an example. 
Concerning the stations on the boundary, the circles represent the surface 

SiTQQi I .A-i I . 

As mentioned before, the St's are assumed to be independent over time 
and colored in space. More precisely, we assume a stationary, isotropic Gaus- 
sian random field 

(8) et^N{0,a^Vp,), a^>0, 
with 

(9) (Vpo)ij. =exp(-dij//3o), Po>0, l<i,j,<N, 

where dij denotes the Euclidean distance between two sites i and j. The 
exponential correlation function is used for computational convenience. In 



(6(si),... 

(5) 
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Fig. 1. Locations of stations. Both axes are in km using the Swiss coordinate system 
(CH1903). The lines illustrate the Voronoi tessellation. Cells with unbounded area have 
been replaced by circles whose area is determined as described in the text. 

principle, it is possible to use other covariance functions, for instance, other 
members of the Matern family. 

The approximation in (5) assumes that /i^ is approximately constant in 
each cell. If some cells are considered to be too large for this approximation 
to be reasonable, additional points can be added for which all observations 
are missing. Since such additional points increase the computational load, 
some compromise has to be found between accuracy and computational 
feasibility. 

2.2. Interpretation and parametrization of the kernel function. For the 
purpose of interpretation, we note that, in the limit when the temporal 
spacing goes to zero, the solution of the IDE (3) can also be written as the 
solution of the stochastic partial differential equation (SPDE) [see Brown 
et al. (2000)] 

(10) ^6(s) = -t^t ■ V6(s) + • I]Vet(s) - r?et(s) + Bt{s), 

where V = (^, ^) is the gradient operator and where -Bj(s) is temporally 
independent and spatially dependent. The terms have the following inter- 
pretations: /x^ • V.^t(s) models advection, fj,^ being a drift or velocity vector. 
The second term is a diffusion term that can incorporate anisotropy, and 
— 77^t(s) accounts for damping. The damping parameter i] is related to (j) and 
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S through T] = — log((;(!)7r|Sj"'^/^). Bt{s) is a source-sink or stochastic forcing 
term that can be interpreted as modehng convective phenomena. This inter- 
pretation is based on the reasoning that typically convective precipitation 
cells emerge and cease on the domain of interest in contrast to larger scale 
advective precipitation that is being transported over the area. 

We now turn to the discussion of the parameterization of /x^ and XI. In our 
application, we have information about wind. It is assumed that the drift 
term fj,^ is proportional to this external wind vector. With fi^ varying over 
time, the model is temporally nonstationary. It is also conceivable that in 
certain situations S or t/ may vary over time and/or space, thus obtaining 
different forms of nonstationarity. Concerning S, it is thought that potential 
anisotropy is related to topography. Denoting by the wind vector at time 
t, we assume 

fj,f = u -wt and 

. T 

,_i _ 1 f cosa sina \ / cosa sina 
-c-sina c-cosa/ \— c-sina c-cosa 



^' - 2 
Pi 



where li € M, c > 0, and a € [0, 7r/2]. We use a wind vector which is averaged 
over the entire area, but the wind could also change locally. The motiva- 
tion for writing 5] in the given form comes from considering a coordinate 
transformation 

f _ f cosQ sina \ f ^ 
\y'J c-sina c-cosq J \y 

where the parameter a is the angle of rotation, and c determines the degree 
of anisotropy, c = 1 corresponding to the isotropic case, pi is a range param- 
eter that determines the degree of interaction between spatial and temporal 
correlation. See Section 4.2 for an illustration of a kernel with the above 
parametrization. 

The resulting model is nonstationary and incorporates anisotropy. Finally, 
we note that there are various other possible choices of parametrizations. For 
instance, a relatively simple model can be obtained by assuming 

(13) /x, = and S-i = ^(^J J 

that is, no drift and an isotropic diffusion term. There is still spatio-temporal 
interaction, though, which implies that the model is not separable in the 
sense that (16) does not hold. We can simplify further and take not only 
Hi = 0, but also S = 0, leading to being the identity matrix 

(14) ^t = <P^t^i+^t. 

This means that each point at time t — 1 only has an influence on itself 
at time t, that is, there is no spatio-temporal interaction and the model is 
separable. 
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2.3. Discussion of the model. 

Propagator matrix Gt- Using a parametrized propagator matrix Gt in 
(6) has the obvious advantage that less parameters are needed than in the 
general case, in which each entry in the matrix has to be estimated, resulting 
in N'^ parameters. Moreover, in contrast to the general case, the parametric 
approach allows for making predictions at sites where no measurements are 
available, which is often of interest in applications. 

Space resolution consistency. At first sight, it might be tempting to use 
a simpler parametrization of Gt not based on a convolution but of the form 



However, such a model has the following important drawback. Assume, for 
instance, that a station i is surrounded by two neighboring sites j and k. Say 
that both stations j and k lie at the same distance from i but in different 
directions. Consequently, j and k at time t — 1 exercise the same influence 
on i at time t. If one adds an additional station I very close to k, the joint 
influence of k and I at time t — 1 on site i at time t would then approximately 
be twice as big as the one of site j. This means that the distribution of the 
process at point i depends on the number and the location of stations in the 
neighborhood at which it has been observed. The convolution model, on the 
other hand, does not exhibit this drawback. Furthermore, the convolution 
model has the advantage that it is "space resolution consistent," that is, it 
retains approximately its temporal Markovian structure if one, or several, 
sites are removed from the domain. This does not hold true for the simpler 
vector autoregressive model as specified in (15). 

Space-time covariance structure. In the following, let us turn to the 
spatio-temporal dependence structure of the latent process A random 
field ^f(s), (s,t) G X M is said to have a separable covariance structure 
[Gneiting, Genton and Guttorp (2007)] if there exist purely spatial and 
purely temporal covariance functions Cs and Ct, respectively, such that 



The convolution based approach allows for nonseparable covariance struc- 
tures, whereas the separable autoregressive model in (14) has a separable 
covariance structure. 

Extremal events. For the data model as specified in equation (1) , Hernan- 
dez, Guenni and Sanso (2009) showed that the distribution of the max- 
ima is a Gumbel. If the focus lies on extremal events, other distributions, 
which have Frechet maxima, can be used, for instance, a t-distribution. 
The t-distribution is particularly attractive since it is a scale mixture of 
normal distributions. To be more specific, if St has a Xdf distribution, 

then Wt = xj'/? + (^^ + Ut)/\/ St/df has a multivariate t-distribution. This 



(15) 




(16) 
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means that the fitting algorithm introduced below can be extended to the 
t-distribution case by introducing an additional latent variable St- 

3. Fitting and prediction. Fitting is done using a Markov chain Monte 
Carlo method (MCMC), the Metropolis-Hastings algorithm [Metropolis et al. 
(1953), Hastings (1970)]. Concerning most parameters, it will be shown that 
the full conditionals are known distributions. Therefore, Gibbs sampling 
[Gelfand and Smith (1990)] can be used in these cases. 

For convenience and later use, we combine the parameters characteriz- 
ing the model into a vector 6 = (A,/3',r^,cr^,pO)''^')' ^^'^ call them pri- 
mary parameters. Our goal is to simulate from the joint posterior distri- 
bution of these parameters and the latent variables ^ = {^i, . . . ,^7^),^o> 
W = (Wi, . . . , Wt^). We note that those Wf (sj) that correspond to observed 
values above zero are known. In that case the full conditional distribution 
consists of a Dirac distribution at Yt{siy^^. For handling the censored values 
and for allowing for missing values, we adopt a data augmentation approach 
[Smith and Roberts (1993)] as specified below in equation (19). See Sec- 
tion 3.1 for more details. 

Assuming prior independence among the primary parameters, the prior 
distributions are specified as 

P[A, (3, r^, 0-2, po, 4>, u, pi,a, c, $q] 

(17) 

with ^0 having a normal prior P[^q1(7^, po] = A^(0, cr^Vpp). Further, po and 
pi have gamma priors with mean pp and variance a p. For c, we assume a 
gamma prior with mean 1 and variance 1, a has a uniform prior on [0,7r/2], 
and u has a normal prior with mean and variance 10^. Further, we assume 
locally uniform priors on log(T'^) and log((7^) as well as for (p, A and (3. 

In our application, we choose to use informative priors for po and pi. It 
is known that in model-based geostatistics difficulties can arise when esti- 
mating the variance and scale parameters of the exponential covariogram 
[see, e.g., Warnes and Ripley (1987), Mardia and Watkins (1989), Diggle, 
Tawn and Moyeed (1998)]. For the geostatistical covariance model, Zhang 
(2004) shows that the product of the two parameters can be estimated con- 
sistently, and Stein (1990) shows that it is the product of the two parameters 
that matters more than the individual parameters for spatial interpolation. 
Further, Berger, De Oliveira and Sanso (2001) show that, at least in the 
simplest setting, the posterior of the range parameters is improper for most 
noninformative priors. Given these considerations, we think that using in- 
formative priors for the two range parameters po and pi is appropriate. In 
our example, we chose priors with mean pp = 100 and variance = 10. We 
have tried different informative priors. The less informative they are, the 
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worse are the mixing properties of the MCMC algorithm. In hne with the 
results of Stein (1990) and Zhang (2004), we have made the experience that 
different choices of priors on these range parameters do not have a strong 
impact on the predictive performance of the model. 
The posterior distribution is then proportional to 

-. X 7V(T+l)/2+l / -I \ NT/2+1 xi/A-1 
^ \ / ^ \ \-(T+l)/2 TT ^t{Si) 



(^) iv„r'--". n 



A 

Yt{si)>0 



(18) 



X exp(-^^^;,V^-i^o) • P[Po] ■ Pm ■ l{m(s.) 



<0 \/i,t:Yt{si)=0}- 

The product in the first line is the Jacobian for the power transformation 
in (1). Note that missing observations do not cause any problem. If Yj(sj) 
is missing, there is no respective term in the product nor a corresponding 
condition for the indicator function. 

3.1. Full conditional distributions. In the following, we derive full con- 
ditional distributions for the individual parameters. 

It is readily seen that the full conditional of /3 is a multivariate normal 
distribution, and the full conditional distribution of (/> is a normal distri- 
bution as well. The full conditionals of both o"^ and are inverse gamma 
distributions. 

For obtaining the full conditionals of W^, we partition its components 
according to whether Yt{si) is above zero, equal to zero, or missing. Denote 

by 4"*"^ those indices for which Yj(sj) > 0, by if'^ those with It(sj) = 0, and 
by ii the missing ones. The vector Wj can then be partitioned into , 
wj'^^, and wj*"' accordingly. We remark that wj"' and wl*"' are latent 
variables, whereas w|^^ corresponds to transformed observed values. In ad- 
dition, w|^' has the restriction that all its values must be smaller than zero, 
wj*^^ < 0. For facilitating understanding, we note that Wt{si) can be written 
as 

Wt{s^) = Wi-^\si) = YtisiY/^ if Yt{s,) > 
(19) =wi^\s,) ifYt{s,) = 

= wj:"^\si) if Yt{si) is missing. 
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The full conditional of W| is then a multivariate normal distribution with 
mean and covariance 

(20) /x^M = (x?"/3 + ^j['"] and S^m = • I. 

Similarly, the full conditional distribution of wj*^' is a truncated multivariate 
normal distribution with mean and covariance 

(21) ^^^[o]=(xf/3 + ^,)[°l and = • I. 

As mentioned before, the full conditional of w[^^ is a Dirac distribution 
with point mass at ( y|^^ ) . 

Concerning the latent variables (^0'^i> • ■ • j^t)) 'we note that conditional 
on G, (^t, Wf) is a linear Gaussian state space model. Therefore, a sample 
from the joint full conditional of {^o,^i, ■ ■ ■ ,^t) can be obtained using the 
forward filtering backward sampling (FFBS) algorithm proposed by Carter 
and Kohn (1994) and Friihwirth-Schnatter (1994). The forward filtering step 
corresponds to the Kalman filter [see, e.g.. West and Harrison (1997) and 
Kiinsch (2001)]. 

Alternatively, one can also use single t updates. The full conditional of 
one , < t < T, is a normal distribution A^(^^^,S^J. In the case of the 
separable model, the mean fj,^^ depends on and ^^+1, whereas the co- 
variance matrix does not depend on t. This is convenient for simulation 
since its Cholesky decomposition has to be calculated only once in each up- 
date cycle. In contrast, in the sampling step of the FFBS algorithm, one 
has to calculate a Cholesky decomposition for each t. The advantage that 
the FFBS algorithm mixes better than the single t update algorithm per 
update cycle is outweighed by the fact that an update cycle of the single 
t update algorithm is a lot faster than one of the FFBS algorithm. Thus, 
more effective samples can be obtained with the single t update algorithm 
per time. In the case of the nonstationary anisotropic drift model, though, 
Xl^^ in the single t update algorithm is not constant over time. Thus, a 
Cholesky decomposition needs to be computed for each t anyway, meaning 
that the FFBS algorithm is preferable. 

In summary, we made the experience that it is recommendable to use 
single t updates for temporally stationary models where the covariance 
of the full conditional of one is constant over time. If Sl^^ changes over 
time, we recommend using the FFBS algorithm. 

For the remaining parameters, that is, po, ■& (excluding cj)) and A, there 
is no apparent distribution family from which one can simulate. Therefore, 
Metropolis steps will be used. We note that the full conditional distribution 
of A is proportional to 

(22) n (^^^^x^)-p(-^ E 
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The parameter A is sampled on the log-scale. This means that we first trans- 
form it to the log scale. Then a proposal is obtained by sampling from a 
normal distribution with the mean equal to the last value of the parameter. 
Thereafter, this proposal is accepted with a probability that is given by the 
usual Metropolis-Hasting algorithm [see, e.g., Chib and Greenberg (1995)]. 

Finally, po and i9 (excluding (p) are sampled together. The full conditional 
is proportional to 




(T+l)/2 



3.2. Prediction. We consider predictions at new locations and/or times 
as well as predictions of areal averages. It turns out that in the case of areal 
averages, the Voronoi tessellation is again useful. 

One way to obtain predictions is to augment the data Yobs with missing 
values at the locations or times where predictions are made. When doing so, 
the MCMC algorithm implicitly draws from the corresponding predictive 
distribution. See the previous Section 3.1 on how to handle missing values. 

If one does not specify the points in space and time where predictions are 
to be made prior to model fitting, the predictive distribution of a new set of 
observations Y* = (V^t (s*), . . . , l^J (s^)' is calculated as 

p[Y*|Yobs] = j P[Y*\CMP[C\iMP\iM'^ohs\ded^de 
(24) ^ — Y" [ p\Y*\$; ,e^%p\c\$.^^ ,o^^]dC 

1=1 

^ m 

yp[Y*ir«,6>W], 



m ■ 

i=l 



where Yobs denotes the observed data, ^ and ^* the latent Gaussian process 
at the observed and predicted sites, respectively, and all the remaining 
parameters. Samples 0^*-* and ^^^\i = l,...,m, from their posterior distri- 
bution are obtained by the MCMC algorithm, and ^*^*^ is sampled from 

When ^* is modeled at the same sites as ^ but at different time points, 
the distribution 0^*^] is Gaussian and readily obtained using (6). 

In the case when predictions are made at unobserved sites s € S and time 
t, P[^t\^^9] can be calculated as described in the following. First, because of 
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the temporal Markov property, -P[^j is equal to P[^t\^t-i^^t^^t+i^^]- 
This density is then obtained by considering the augmented model 

where GJ' is defined analogously to (7), H^+i and H*^;^ are obtained from 
the same approximations as in (5), and the covariances of et and are 
as in (9). By (25), the conditional distribution of given is 

normal. Therefore, also the conditional distribution of given ^f_i,$f,$f_^i 
is Gaussian. Its mean and covariance can be computed by noting that 

P[et\^t-i,^t,^t+i,o]^p[^t+,\C,^t,e]p[et\^t_„^^,e] 

(26) 

and then completing the square in the exponent of the last expression. 
In many cases, for instance, when the focus lies on flooding, areal averages 

(27) y/^*) = ^^^y,(s)ds 

of precipitation are of interest. If Yt{s) is observed on an irregular grid, 
one could first define a regular grid, then interpolate the nonobserved grid 
points, and approximate the integral in (27) by a Riemann sum. However, 
since the regular grid usually becomes very large, this is computationally 
expensive. Instead, we propose to use the Voronoi tessellation once again to 
approximate the integral 

If 1 ^ 

(28) ^i^'^ = TA^\ Y,{s)ds^-j-Y,Yt{sM^r^A*\. 

\ \ JA* II -^^ 

Thereby, an adequate weight \Aj r\A*\ is given to each station. Samples from 

the predictive distribution of Y^^ ^ can be obtained by simulating Y^^*^ (sj ) 
from their predictive distribution and inserting them in (28). 

We note that the areal prediction becomes deterministic if all Yt{sj) con- 
sist of observed values. This means that uncertainty about values of Yt{s) 
at locations where no observations are made is implicitly ignored with the 
above approximation. This can be amended for by first making predictions 
at a few sites where no observations were made. Inserting additional unob- 
served sites can also be useful in other cases. For instance, if A* cuts off a 
substantial part of any Aj, that is, Aj r\A* is much smaller than Aj but not 
empty, the areal prediction might be improved by replacing Yt{sj) by the 
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prediction of Yj at the center of gravity of Aj CiA*, or if the area A* is smah 
and contains only a few stations, improved predictions of the areal average 
can be obtained by making predictions at a few additional points inside the 
area. 

4. Application to short term prediction of precipitation. We apply the 
model to obtain short term forecasts of precipitation. Such forecasts are im- 
portant, for instance, for agriculture and flooding. The traditional way for 
obtaining precipitation forecasts is the use of numerical weather prediction 
(NWP) models. NWP models solve complex, nonlinear equations emulat- 
ing the dynamics of the atmosphere. Typically, NWP models require a lot 
of computational resources to run. Fitting our statistical model using the 
MCMC algorithm presented above is also computationally intensive. How- 
ever, once the statistical model is fitted and assuming that the posterior of 
the primary parameters does not change (see Section 4.3 for more details), 
predictions are computationally a lot cheaper. Furthermore, the statistical 
model can be used in situations where there are no NWP models available 
or to obtain predictions at different temporal resolutions than the one at 
which the NWP model operates. 

4.1. The data. The data consists of three-hourly precipitation amounts 
collected by 26 stations around the Swiss Plateau from the beginning of 
December 2008 to the end of March 2009, making a total of 968 time periods. 
The data were provided by MeteoSwiss. We use the first three months, 
consisting of 720 time periods, for fitting the model. The remaining month 
March, consisting of 248 time periods, is set aside for model evaluation. 
The locations of these stations are shown in Figure 1. In Figure 2, a time 
series plot of the observed precipitation at one station (corresponding to 
the station with the acronym WYN in Figure 1) and of the weighted areal 
average is shown. Concerning the latter, we take the weighted average over 
the entire spatial domain. Figure 3 shows the spatial distribution of the 
precipitation accumulated over time. 

The covariates consist of the x- and y-coordinates (km) , altitude (m) , tem- 
perature (°C), dew point (°C) and specific humidity (%). Specific humidity 
is the ratio of water vapor to dry air in a particular mass. It is expected to 
be positively related to precipitation. The dew point is the temperature to 
which a given parcel of humid air must be cooled, at constant barometric 
pressure, for water vapor to condense into water. Thus, the lower the dew 
point, the lower is the chance for precipitation. However, specific humidity 
and dew point are considerably negatively correlated. This makes it unclear, 
a priori, what their joint relation to precipitation is like. Temperature, dew 
point and specific humidity are predicted variables obtained from an NWP 
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Fig. 2. Precipitation versus time. The lines are observed precipitation of one station 
(corresponding to the station with the acronym WYN m Figure 1) and of the areal average. 
The time axis is in 3 h steps starting at December 1, 2008. The dotted vertical line separates 
the training and test data. 
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Fig. 3. Illustration of the spatial distribution of precipitation. The circles display the 
cumulative rainfall amounts over time at the stations. The larger the circle and the darker 
the color, the higher is the cumulative precipitation amount. Both axes are in km. 
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model called COSMO-2. From the same model, we also obtain wind predic- 
tions (speed is in m/s). Predictions of the statistical model are evaluated 
by comparing them to precipitation forecasts from the same NWP. Having 
a high resolution with a grid spacing of 2.2 km, the NWP model is able to 
resolve convective dynamics. The NWP model produces predictions once a 
day for 24 hours ahead starting at 0:OOUTC. After assimilation and com- 
putation, forecasts are available at around 1:30UTC. For all meteorological 
variables, we use values at approximately 1000 m above ground. This is 
the height where we think these variables to be most influential for pre- 
cipitation. All covariates are centered and standardized to unit variance. 
Centering covariates around their means is used in order to avoid correla- 
tions of the regression coefficients with the intercept and to reduce posterior 
correlations. 

4.2. Fitting and results. In the following, the nonstationary anisotropic 
model incorporating the wind as an external drift term (see Section 2) is 
fitted. In addition, we also fit a separable model. We simulate from the 
posterior distributions of these models as outlined in Section 3. 

After the burn-in period consisting of 5000 draws, 195,000 samples from 
the Markov chain were used to characterize posterior distributions. Conver- 
gence was monitored by inspecting trace plots. 

In Table 1 we show posterior modes as well as 95% credible intervals 
for the different parameters of the nonstationary anisotropic drift model. 



Table 1 

Posterior modes and 95% credible intervals for the nonstationary, anisotropic model with 

an external drift 





Mode 


2.5% 


97.5% 


Intercept 


-1.05 


-1.21 


-0.929 


X 


-0.0473 


-0.133 


0.0541 


Y 


-0.0108 


-0.0846 


0.0531 


Z 


0.00347 


-0.0169 


0.0247 


Temp 


-0.717 


-0.856 


-0.583 


Dew point 


0.406 


0.187 


0.601 


Spec hum 


1.14 


0.949 


1.33 


A 


1.58 


1.54 


1.62 




0.0685 


0.0451 


0.0943 




1.04 


0.953 


1.17 


Po 


92 


86.4 


97.9 




0.000159 


0.000147 


0.00017 


pi 


93.6 


88.1 


99.4 


c 


4.1 


3.61 


4.63 


a 


0.704 


0.658 


0.777 


u 


0.879 


0.645 


1.1 
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t = 429 




-50 50 100 



Fig. 4. Illustration of the convolution kernel at time t = 429. The colors indicate the 
lag-1 influence of the other stations on the station Wynau. The white arrow represents the 
drift caused by a south-west wind at this time point. The dots represent the observation 
stations. The axes are in km. 



The coefficients of the geographic coordinates are not significant. Specific 
humidity has a large positive coefficient. As expected, higher humidity im- 
phes more rainfah. The dew point is also positively related to precipitation. 
Higher temperatures, on the other hand, seem to imply less precipitation. 

For interpreting the fitted parameters governing the convolution kernel 
{pi, c, a and u), we illustrate in Figure 4 the convolution kernel over the 
region where the stations lie. The parameters pi, c, a and u are taken at their 
posterior mode. The plot is interpreted as follows. The height of the kernel is 
the level of influence that $,i_i{s') at location s' has on ^j(s) at location s as 
a function of s' — s. In other words, the colors represent the lag-1 influence 
of the other stations on the station Wynau which is used as origin in the 
plot. The white arrow represents the drift vector fi^ = u-wt at time t = 429, 
wj being the wind vector. Note that this transport vector changes over 
time, thus causing temporal nonstationarity. The time t = 429 illustrates a 
meteorological situation with the typically predominant southwestern wind 
direction. 

With c and a being approximately 4 and 0.7, we observe anisotropy along 
the south-east north-west direction. This corresponds to the topography of 
the region, as the area containing a majority of the stations lies between two 
mountain ranges: the Jura to the north-west and the Alps to the south-east. 
Correlations are expected to be higher along the flat part between these two 
mountain ranges. 

Furthermore, the plot shows how the external drift shifts the convolution 
kernel. Apparently, the southwestern neighbor (Bern) has the highest in- 
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fluence on Wynau in this situation, with wind coming from the southwest. 
Gneiting et al. (2006) observe a similar phenomenon in wind speed data over 
the U.S. Pacific Northwest where there is also a predominant wind direction 
causing asymmetric cross-correlations. 

4.3. Short term prediction of precipitation. In the following, we apply 
the fitted models to produce short term predictions of precipitation. As 
mentioned before, we have fitted the model to the first 720 time periods from 
December 2008 to February 2009. From this we obtain posterior distributions 
for the primary parameters. Predictions for the time periods in March that 
were set aside are obtained as described in the following. 

Ideally, one would run the full MCMC algorithm at each time point, 
including all data up to the point, and obtain predictive distributions from 
this. However, since this is rather time consuming, we make the following 
approximation. We assume that the posterior distribution of the primary 
parameters given Yi ■ t = {Yi , . . . , Yj } is the same for all t > 720. That is, we 
neglect the additional information that the observations in March give about 
the primary parameters. In practice, this means that posterior distributions 
of the primary parameters are calculated only once, namely, on the data set 
from December 2008 to February 2009. 

For each time t > 720, we make up to 8 steps ahead forecasts. That 
is, we sample from the predictive distribution of Y^^^, k = 1, . . . ,8, given 
Yi;t = {Yi,...,Yj} and given the posterior of the primary parameters 
based on the data from December 2008 to February 2009. Since the NWP 
produces forecasts for the three meteorological covariates once a day, for 
each prediction time t + k, the forecasts made at 0:OOUTC of the same day 
are used. Sampling from the predictive distribution consists of imputing the 
augmented data W and sampling from the latent process ^. These two steps 
are done as described in Section 3. To generate one sample from the predic- 
tive distribution takes around 3.5 seconds on an AMD Athlon(tm) 64 X2 
Dual Core Processor 5600+ with a 2900 MHz CPU clock rate. We use 200 
samples to characterize each predictive distribution. 

The assumption that the posterior of the primary parameters does not 
change may be questionable over longer time periods and when one moves 
away from the time period from which data is used to obtain the poste- 
rior distribution. But since all our data lies in the winter season, we think 
that this assumption is reasonable. If longer time periods are considered, 
one could use sliding training windows or model the primary parameters as 
evolving dynamically over time. One can also investigate how the predictive 
performance deteriorates with increasing lags between predictions and last 
time point from which data is used to fit the model. 

In addition to the separable model and the nonstationary anisotropic 
drift model, we fit a model with no autoregressive term, that is, with i;^ = 0. 
Further, to assess how much information stems from the three meteoro- 
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logical covariates (temperature, dew point and specific humidity) and how 
much from the dynamic spatio-temporal model, we also fit the nonstationary 
anisotropic drift model without including these covariates. For each model, 
we calculate pointwise predictions for the individual stations and also pre- 
dictions for the areal average. The latter are obtained using the Voronoi 
tessellation as described in Section 3.2. 

In order to asses the performance of the probabilistic predictions, we use 
the continuous ranked probability score (CRPS) [Matheson and Winkler 
(1976)]. The CRPS is a strictly proper scoring rule [Gneiting and Raftery 
(2007)] that assigns a numerical value to probabilistic forecasts and assesses 
calibration and sharpness simultaneously [Gneiting, Balabdaoui and Raftery 
(2007)]. It is defined as 

/oo 
{F{x)-l{y^,}fdX, 
-oo 

where F is the predictive cumulative distribution function, y is the observed 
realization, and 1 is an indicator function. It can be equivalently calculated 
as 

(30) CRPS(F, y) = Ef\Y -y\- \Ef\Y - Y'\, 

where Y and Y' are independent random variables with distribution F. If a 
sample Y^^^ , ■ ■ ■ , Y^"^') from F is available, it can be approximated by 

i=l *ij=l 

In Figure 5 the average CRPS of the pointwise predictions and the areal 
predictions are plotted versus lead times. In the left plot, the mean is taken 
over all stations and time periods, whereas the areal version is an average 
over all time periods. Predictions Y^+^, k = 1, . . . , 8, for the next 8 time steps 
are made at each time point t. We recall that the NWP model produces 
predictions for 8 consecutive periods once a day at midnight. For simplicity, 
potential diurnal variation in the accuracy of the predicted covariates is 
ignored. 

We see that the nonstationary anisotropic drift model ("ConvAR") has 
clearly the best performance among the three models. In particular, the non- 
separable convolution based model performs better than the simpler sepa- 
rable spatio-temporal model ("SAR"). Not surprisingly, the model without 
temporal dependency ("NoAR") performs worse than the other two mod- 
els. Comparing the "ConvAR" model, the nonstationary convolution model 
without covariates ("ConvAR No Cov"), and the "NoAR" model, we see 
that the main source of predictive performance at small lead times is not 
the covariates but the dynamic spatio-temporal model. In the areal case. 




Fig. 5. Comparison of statistical models. The continuous ranked probability score 
(CRPS) of forecasts versus number of consecutive time periods for which predictions are 
made is shown. On the left are CRPSs of station specific forecasts and on the right are 
CRPSs of areal forecasts. "NoAR" denotes the model without an autoregressive term, 
"SAR" the one with a separable covariance structure, and "ConvAR" the convolution 
based nonstationary anisotropic drift model. All three models include the covariates de- 
scribed in Section 4-1- A convolution based model without including covariates ("ConvAR 
No Gov") is also fitted. The unit of the CRPS is mm. 



the nonstationary convolution model without covariates even outperforms 
the simple autoregressive model including covariates at small lead times. 
With increasing lead time, the meteorological covariates contribute more to 
the predictive performance and the dynamic spatio-temporal model becomes 
less important. 

We also compare the performance of the predictions from the nonstation- 
ary anisotropic drift model with predictions obtained from the NWP model. 
Since the NWP model produces deterministic forecasts, we use the mean 
absolute error (MAE). In order to make the comparison fair, we first reduce 
the statistical distributional forecast to a point forecast by taking the me- 
dian [see Gneiting (2011) on why this is a reasonable choice]. As mentioned, 
the NWP model produces predictions once a day starting at 0:OOUTC. Pre- 
dictions are then made for eight consecutive time periods corresponding to 
24 h ahead. This means that the time of day also corresponds to the lead 
time. This is in contrast to the above comparison of the different statistical 
models where 8 step ahead predictions were made at all time periods. 

In Figure 6 the mean absolute error (MAE) of forecasts versus lead time, 
or time of day is shown. In addition, in Table 2 we report MAEs averaged 
over all lead times. Note that there is one particular day (March 24) when 




Fig. 6. Comparison of statistical and NWP model. The mean absolute error (MAE) of 
forecasts versus lead time is shown. Lead time also corresponds to the time of day. The 
left panel shows MAEs of station specific forecasts averaged over time and the stations, 
and the right panel shows MAEs of areal forecasts averaged over time. "ConvAR" denotes 
the convolution based nonstationary anisotropic drift model and "NWP" the NWP model. 
The bold lines show the results when excluding March 24, 2009. The unit of the MAE is 
mm. 



heavy rainfall occurred shortly after 0:OOUTC. We report results including 
(thin lines) and excluding (bold lines) this day. 

Table 2 shows that overall the statistical model outperforms the NWP 
on a stationwise base. When considering the areal average, the two models 
perform similarly. Depending on whether March 24 is included or not, the 
NWP or the statistical model has a slightly lower average MAE. 

Furthermore, Figure 6 shows that March 24 considerably affects the per- 
formance of the one- and two-step ahead predictions of the statistical model 
as well as the stationwise performance of the NWP model. When exclud- 
ing this day, the corresponding MAEs are considerably lower. This shows a 

Table 2 

Comparison of statistical and NWP model. The mean absolute error (MAE) 
averaged over all days and lead times is reported. "ConvAR" denotes 
the convolution based nonstationary anisotropic drift model and 
"NWP" the NWP model. The unit of the MAE is mm 

ConvAR NWP Areal ConvAR Areal NWP 



March 2009 0.41 0.46 0.35 0.32 

Excluding March 24 0.36 0.43 0.29 0.31 
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typical behavior of our model and statistical models in general: they per- 
form well when, at the time of prediction, the major phenomena (advective 
fronts) are already observable. In this case, the spatio-temporal statistical 
model can extrapolate the space-time dynamics of the rainfall process into 
the future. 

Earlier studies have shown that nowcasting methods, including statistical 
approaches, perform usually better at short lead times (up to one day), while 
NWP have higher predictive skills at medium ranges [see Kober et al. (2012) 
or Little, McSharry and Taylor (2009)]. Our results are in line with these 
findings in the sense that all lead times used in our application are still in 
the range of what is considered "short" lead times. However, our model is 
not just based on past precipitation observations but also on other predicted 
meteorological variables. 

5. Conclusions. A hierarchical Bayesian spatio-temporal model is pre- 
sented. Incorporating physical knowledge, the dynamic model is nonsta- 
tionary, anisotropic, and allows for nonseparable covariance structures. It 
incorporates a drift term that depends on a wind vector. At the data stage, 
the model determines the probability of rainfall and the rainfall amount 
distribution together. The model is fitted using Markov chain Monte Carlo 
(MCMC) methods and applied to obtain short term precipitation forecasts. 
It performs better than a separable, stationary and isotropic model, and it 
performs comparably to a deterministic numerical weather prediction model 
and has the advantage that it quantifies prediction uncertainty. 

Even though we have applied the model to prediction of precipitation, it 
can also be used to predict or interpolate other meteorological quantities of 
interest. 

Future research could focus on adapting the model so that in can be ap- 
plied to spatially highly resolved data. Using Markov random fields [Rue 
and Held (2005), Lindgren, Rue and Lindstrom (2011)] for the innovation 
process St might be a potential direction. Alternatively, a dimension reduc- 
tion approach could be examined; cf. Banerjee et al. (2008). For instance, 
Sigrist, Kiinsch and Stahel (2012) approximate an advection-diffusion SPDE 
to cope with large data sets. Further, the model can be extended by addition- 
ally relaxing some assumptions. For instance, the parameters o"^, 0, po, pi 
and A were assumed to be constant over time. Assuming periodicity, Fourier 
harmonics could be used to model parameters that vary seasonally during 
the year. Alternatively, the parameters could evolve dynamically over time 
according to an equation of the form = ^t~i + A^(0,o"|). 
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